Visit With Us - Travel Package Prediction

Summary

Visit With Us is a compnay that is looking to expand their customer base within the the Tourism Sector. They offer travel packages to customers and currently offer up 5 types:

  1. Basic
  2. Standard
  3. Deluxe
  4. Super Deluxe
  5. King

The previous purchase rate they have experienced is approximately 18%. During the last campaign the randomly contacted customers wihtout targeting based on information about the customers. This time they are releasing a new product, a Wellness Tourism Package. This package is a travel package designed to help the customer kick-start a healthy lifestyle/support ones well-being.

Purpose: Visit With Us wants to target advertising for their Wellness Tourism Package to potential customers as a way of increasing their customer base.

Objectives:

Data Dictionary

  1. CustomerID: Unique Customer Id
  2. ProdTaken: Whether customer purchased a package {0:No, 1:Yes}
  3. Age: Age of customer
  4. TypeofContact: How customer was contacted [Company Invited, Self Inquiry]
  5. CityTier: Development of a city, population, faciliites, and living standards [Tier 1 > Tier 2 > Tier 3]
  6. Occupation: Customer occupation
  7. Gender: Gender of customer
  8. NumberOfPersonVisiting: Total number of persons taking the trip with the customer
  9. PreferredPopertyStar: Prefferred hotel property rating
  10. MaritalStatus: Marital status of customer
  11. NumberOfTrips: Average number of trips in a year by customer
  12. Passport: If customer has a passport {0:No, 1:Yes}
  13. OwnCar: If customer owns a car {0:No, 1:Yes}
  14. NumberOfChidrenVisiting: Total number of children with age less than 5 on trip
  15. Designation: Designation of the customer in the current organization
  16. MonthlyIncome: Gross monthly income of the customer
  17. PitchSatisfactionScore: Sales pitch satisfaction score
  18. ProductPitched: Product pitched by salesperson
  19. NumberOfFollowups: Total number of follow-ups by sales person after sales pitch
  20. DurationOfPitch: Duration of pitch by salesperson to customer

Data Exploration & Data Cleaning

Initial assessment of the data

Notes

Notes

Notes

Basic Cleaning steps:

There are not strong patterns of the Nan valueswithin the data. Therefore each feature will likely need to be assessed individually.

Interpolation is unlikely to greatly influence any of the features since none have a proportion of Nans significantly over 5% of the data entries. Will need to check if there are correlations that could be used to interpolate based on relationships to other related features or if median and mode are the best routes.

TypeofContact is the only categorical variable, all the others are numerical that have Nans.

There are no major linear relationships preset within the data. The highest is a 0.46 realationship between Age and MonthlyIncome, which could be worth assessing more carefully.

There are no major correlations in the numerical columns with Nans. There is a very slight one between Age and Monthly Income, but nothing likely to help dramatically with interpreting Nans.

Nans will be substituted for the median and mode of the columns depending on whethe the data is numerical or categorical.

All NaNs have been removed from the dataset

Examination for discrepancies in data entry

Notes:

Notes:

Notes

Notes

Adjust data types on dataset to be more appropriate

Notes

Exploratory Data Analysis

Aspects to explore and assess within the data:

Features can be grouped into three types of features:

All of these features are potentially relevant, but thinking about them along these lines will assist in exploring the data for potential trends. In part because it will limit the figures viewed at any one time

Assess the target variable

Customers only take a product 18.8% of the time. Out of 4888 customers only 920 purchased a travel package.

Notes

No dramatic clustering in any of the trends with regards to customers who took or did not purchase a product based on numerical features. Interestingly, you can see on the univariate distributions, ththe customers who took the package tend to be essentially the same, just at a lower frequency, which makes sense as most customers do not take a package. There are some potential differences to explore dmonstrated by the trend lines in these plots:

You can observe other trends, unrelated to whether a customer purchased or not. Unsurprisingly, Number of children visiting and number of people visiting appear to have a positive correlation, as children are still people, so will be counted in both. There also is a positive trend for Age and Monthly Income, but the relationship is not actually linear. Just it's lower bound on Age. Essentially, if you are older, the minimum income likely to be higher, but there appears to be no upper bound on monthly income related to age.

Notes

We observe the same effect with regards to sales features as we saw with customers. There are no obvious shifts in the likelihood of a customer purchasing a travel package based on a single numerical feature of the sale overall. There is an interesting overall trend though when looking at the position of trendlines of customers who purchased a plan versus those who did not. Slight increases in all sales metrics may lead to a slight increase in sales themselves. However, the difference does appear to be slight, but it is consistent that the trend line for customers who bought plans is shifted above those who did not slightly on all metrics.

Notes

Notes

Notes

Notes

Notes

Notes

Notes

Notes

Notes

Modeling

Data is not all numerical, therefore a classification model is required. Will need to examine whether a logistic regression analysis may be suitable with regards to the target ProdTaken. If not, Decision Tree models will work best for this dataset.

Notes

The trends on the data with regards to having logistical regression trends is either absent or weak across the variables. Decision Tree Models will be used for this analysis.

Target Metric for models

Recall is the primary target metric for these models. This will optimize for ensuring a customer likely to purchase a plan is pursued, as opposed to missing out on some of those potential customers in favor of being more confident each customer we contact will purchase a travel plan.

Decision Tree Model

Notes

Notes

Notes

Notes

Bagging Ensemble

Notes

Notes

Notes

Random Forest Ensemble

Notes

Notes

Notes

AdaBoost Classifier

Notes

Notes

Gradient Boost

Notes

Notes

XGBoost Classifier

Notes

Notes

Stacking Model

Examine models up to this point to decide which to stack

The models with the least overfitting for each type of model that can work in for Stacking are:

These models will be included in the stacking model with the XGBoost as the final estimator.

Notes

Comparison of all models performances

What models worked the best?

Of all the models the Decision Tree Tuned had the greatest recall with a rate of 78.5%. This is combined with a slight overfitting with a recall on the training of 83.6%. However, the overfit is slight compared to most models. It does suffer compared to others with regards to Accuracy (73.4%) and Precision (39.4%). Therefore, the model is actually not very good at knowing if a customer will not purchase a package, but it is better than most at identifying those who will.